Introduction to R Data Analysis - Part 1

Natalie Elphick

January 22nd

Press the ? key for tips on navigating these slides

Introductions

Natalie Elphick
Bioinformatician I
Bioinformatics Core

Poll 1

What is your level of experience with coding/data analysis?

  1. I am fluent in another data analysis programming language (Python, Matlab etc.)
  2. I am use Excel to do linear regression
  3. I know some R
  4. All of the above
  5. None of the above

Part 1:

  1. What is R and why should you use it?

  2. The RStudio interface

  3. File types

  4. Error messages

  5. Variables

  6. Types & data structures

    10 min break

  7. Math and logic operations

  8. Functions and libraries

  9. Reading data into R

What is R?

R

  • An open source language developed for statistical computing by Ross Ihaka and Robert Gentleman
  • Inspired by the S language developed at Bell labs in 1976 to make interactive data analysis easier
  • The first official version was released in 2000

Why use R for data analysis?

  • R is and will always be free
  • Can easily implement any statistical analysis
  • Code serves as a record which enables reproducibility with minimal effort
  • As of March 2023, there were over 19,000 open source packages to extend its functionality
    • Highly customizable graphics (ggplot2)
    • Analysis reports (knitr)
    • RNA-seq analysis (DESeq2)

How does it work?

Programming

RStudio

RStudio

  • RStudio is an integrated development environment (IDE)
  • It makes R code easier to write by providing a feature rich graphical user interface (GUI)



R and RStudio

Layout

Layout

File types

  • Rscript files that end in .R
    • The most basic, a file that contains R code
  • RMarkdown files that end in .Rmd
  • Let’s create a blank Rscript to see how they work, open RStudio and click:
    • File -> New File -> R Script

R Markdown

  • A file format combining R code with Markdown for text formatting.
  • Designed for creating reproducible research reports in various formats (HTML, PDF, Word).
  • Let’s create an Rmd file in RStudio to explore the basics of how they work:
  • File -> New File -> R Markdown

R Markdown Advanced Usage

  • Presentations: Creating slides (like these) with revealjs.
  • Publications: Authoring online books that combine narrative, code, and output with bookdown.
  • Interactive Documents: Developing interactive tutorials or dashboards with learnR and other embedded applications.

Variables

Variable definition

  • Variables store information that is referenced and manipulated in a computer program
  • In contrast to the mathematical definition of a variable, variables in computer science are mutable
  • There are 3 ways to define variables in R, but one is preferred:
x <- 1  # Preferred way
x = 1
1 -> x
print(x)
## [1] 1

Variable naming

  • Variables names must start with a letter and can contain underscores and periods
  • It is best practice to use descriptive variable names and stick to one style of names
# Snake case
dog_breeds <- c("Labrador Retriever","Akita", "Bulldog")

# Period separated
dog.breeds <- c("Labrador Retriever","Akita", "Bulldog")

# Camel case
DogBreeds <- c("Labrador Retriever","Akita", "Bulldog")

Poll 2

Which variable name is not valid in R?

  1. cat_dog
  2. CatDOG
  3. cat.dog
  4. catD0g

Excercise 1

  • Open Rscript file part_1.R in Rstudio

Data Types and Structures

Data Types

  • Integer
    • Whole numbers
  • Numeric
    • Decimal numbers
  • Logical
    • Boolean (TRUE, FALSE)
    • NA
  • Character
    • Letters and strings of letters
    • “A”, “Labrador Retriever”

Data Structures

  • Vectors
    • Atomic vectors - one dimensional lists that store values of the same type
    • Lists - can be multidimensional and contain different types/structures (ex. nested lists)
  • Factors
    • Ordered list with assigned levels
  • Matrix
    • Columns and rows of the same type
  • Data frames
    • Columns and rows of mixed types

Data structures

Exercise 2: Data Types and Structures

  • Reopen Rscript file part_1.R in Rstudio

10 min break

10:00

Math and Logic Operations

Math & Logic

  • Built in functions to get common mathematical summaries of data (eg. mean( ), median( ), mode( ) )
  • Relational comparison operators to compare values
x == y  # Equal to
x != y  # Not equal to
x <  y  # Less than
x > y   # Greater than
x <= y  # Less than or equal to
x >= y  # Greater than or equal to

x %in% y # Is x in this vector y?

Logical Operators

  • Logical operators can compare TRUE or FALSE values
x <- TRUE
y <- FALSE

!x     # Not x
x | y  # x or y
x & y  # x and y

Conditional execution

  • Relational and logical operations allow for conditional execution of code
if ("Akita" %in% dog_breeds) {
  print("dog_breeds already contains Akita")
} else {
  dog_breeds <- c("Akita", dog_breeds)
}
## [1] "dog_breeds already contains Akita"

Functions

Functions

  • A function is block of organized, reusable code that is used to perform a single action
  • R has many built in functions, these are called base R functions
  • Not all arguments are required and some have default values

Functions

Defining a function

  • To define a function we use the function keyword, the output is specified with the return keyword:
add_dog <- function(dog_to_add,
                    input_vector) {
  if (dog_to_add %in% input_vector) {
    
    message("Already contains this dog")
    
  } else {
    
    output <- c(dog_to_add, input_vector)
    return(output)
    
  }
}

Example

add_dog(dog_to_add = "Akita",
        input_vector = dog_breeds)
## Already contains this dog
add_dog(dog_to_add = "German Shepard",
        input_vector = dog_breeds)
## [1] "German Shepard"     "Labrador Retriever" "Akita"             
## [4] "Bulldog"

Packages

Packages

  • Packages are collections of functions that are specialized to a specific task (plotting, data manipulation etc.)
  • The tidyverse is a collection of commonly used data analysis packages - Learning curve is less steep - Lots of useful packages for data analysis

tidyverse

End of Part 1

Workshop survey

  • Please fill out our workshop survey so we can continue to improve these workshops

Upcoming Workshops

  1. Introduction to Statistics, Experimental Design, and Hypothesis Testing
    • Jan 25, 2024 (Session 1 - 10am–12pm) (Session 2 - 1pm–3pm)
    • Jan 26, 2024 (Session 3 - 10am–12pm)
  2. Intermediate RNA-Seq Analysis Using R
    • Feb 1, 2024 (9:30am-12:00pm)

ChatGPT Tips for R

General Tips

  • Always confirm ChatGPT’s outputs are correct
  • Provide as much detail as possible about the problem in the 1st prompt
  • Use separate chats for separate tasks/projects
  • Try the ‘Custom Instructions’ function that adds additional information to every prompt
  • Can visit webpages (GPT 4 only), which can help get more specific answers

Code Tips

  • Commented R code yields better responses in my experience
  • Provide the code and error message in the same prompt
  • ChatGPT can work well to convert syntax and improve your code:
    • “Turn this loop into a function : [your code]”
    • “Is there a better way to do this : [your code]”
  • Check out the file: example_code/1_convert_syntax_example.R for an example use case

Finding R Packages

Key Questions

  • What assay was the package designed for?
  • When was the last release?
  • Is it maintained (frequent updates)?
  • Does it work on all operating systems?
  • Are other people using it? (citations)
  • Do they respond to github issues?
  • Is there a benchmarking paper?

BioConductor and CRAN

  • Both of these have stringent requirements for packages they host (eg. for BioConductor they have to run on all major operating systems)

  • Prefer BioConductor packages if available over CRAN

  • Prefer CRAN packages over ones only hosted on GitHub

Start with the Assay

  • Click here to go to BioC views
  • Pick the assay you want to analyse
  • Pick the type of analysis you want to do
  • Find a package that does it
  • Find benchmarking papers to narrow the list of packages down
  • Find the vignette on the package page and refer to the manual for any questions not covered by it

Additional Resources

R

Statistics

RNA-seq Analysis

Dimensional Reduction